CS109 Final Project: Nicholas Ruta, Ayin Mokrivala, and Anna Whitney

There are an average of 6,000 tweets produced on Twitter per second. We think Twitter provides a great value to do sentiment analysis on text. Twitter posts are mostly public and can be used for such studies extensively. Also, frequent use of hashtags makes it more interesting to draw conclusions.
Sentiment analysis on airlines intrigues us since the industry is heavily price oriented. Often prices for tickets from different airlines are in similar range, putting emphasis on the quality of travel experience for the customer. But the customer understanding of the airline is commonly based on personal experience or general news. The customer is interested to know which airlines have a better reputation since the ticket prices are alike.
We did a sentiment analysis on tweets provided by the Twitter Streaming API in order to find the most preferable airline. We chose a 'top ten' selection of airlines for the analysis. The selection was made based on two factors. The first factor being an excellent article written by Nate Silver and the team at fivethirtyeight.com that describes a recent 'best and worst airlines' analysis they did. The second factor being our initial analysis of the data collection process to verify that a sufficient amount of data is created for each airline. The ten airlines we ended up with were United, Alaska Air, Frontier, Hawaiian Air, Virgin America, Southwest Air, Delta, JetBlue, Spirit Airlines and American Air.
We created a Python script to collect raw data from the Twitter Streaming API:
#MUST ADD YOUR TWITTER DEVELOPER CREDENTIALS FIRST
#you can rename this cell as get_twitter.py and run from the command line using this command -
# python get_twitter.py > twitter_data.txt
#Import the necessary methods from the tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
#Set the twitter developer credentials to access the Twitter Streaming API
access_token = ""
access_token_secret = ""
consumer_key = ""
consumer_secret = ""
#This is a basic listener that prints received tweets to stdout.
class StdOutListener(StreamListener):
def on_data(self, data):
print data
return True
def on_error(self, status):
print status
#THIS SECTION IS COMMENTED OUT SINCE THE DEVELOPER CREDENTIALS ARE NOT IN PLACE ABOVE
# if __name__ == '__main__':
#This handles Twitter authetification and the connection to Twitter Streaming API
# l = StdOutListener()
# auth = OAuthHandler(consumer_key, consumer_secret)
# auth.set_access_token(access_token, access_token_secret)
# stream = Stream(auth, l)
#Use the filter to capture data from the stream by keywords -
# stream.filter(track=['@united', '@AlaskaAir', '@FlyFrontier', '@HawaiianAir', '@VirginAmerica','@SouthwestAir','@Delta','@JetBlue','@SpiritAirlines','@AmericanAir','#united', '#AlaskaAir', '#FlyFrontier', '#HawaiianAir', '#VirginAmerica','#SouthwestAir','#Delta','#Jetblue','#SpiritAirlines','#AmericanAir'], async=True)
We combined all of the .txt files of raw data created from the above script into one clean and final .json file:
# #Uncomment if you want to combine .txt files from the above twitter streaming api process.
# #take in the combined/entire dataset of json rows
# #remove blank lines and only write the row to the 'cleanfile.json' file if it is valid JSON
# import json
# import fileinput
# import glob
# a function to verify that a row in the raw data file is valid JSON. We noticed that the Twitter Streaming API
# did occasionally return data in error and had to take this step to clean the dataset
# def is_json(myjson):
# try:
# json_object = json.loads(myjson)
# except ValueError, e:
# return False
# return True
# the glob library makes it easy to grab all of the text files we placed in a tweets folder
# these were the entire collection of raw data we brought in from the Twitter Streaming API over the Nov-Dec 2015
# timeframe.
# file_list = glob.glob("tweets/*.txt")
# combined_file_name = 'combined_result.json'
# with open(combined_file_name, 'w') as file:
# input_lines = fileinput.input(file_list)
# file.writelines(input_lines)
# f = open('clean_final_file.json','w')
# for line in open(combined_file_name):
# line = line.rstrip()
# if line != '':
# if is_json(line) is True:
# f.write(line + "\n") # python will convert \n to os.linesep
# f.close() # you can omit in most cases as the destructor will call it
Our project focuses on analyzing the text of over 100,000 tweets about airlines using sentiment analysis and LDA to determine which airline receives the most positive or negative attention on Twitter, and what topics people are happy or sad about with regards to each airline.
We start by importing the modules we need.
import json
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
The Twitter Streaming API returns JSON objects. We took the resulting JSON and used the Python libraries json and pandas to create a list of 'raw_data' to be used in our text processing and visualizations. Our data can be downloaded from Dropbox, and the below assumes that the JSON file containing the data is placed in the directory above the directory containing this notebook.
#create a file to contain the 'raw data' from the twitter streaming api
raw_data_path = '../tweets_all@#_103653.json'
#create a list to hold the tweets
raw_data = []
#create and open the new file
tweets_file = open(raw_data_path, "r")
#append to the tweets list from the raw data file
for line in tweets_file:
try:
tweet = json.loads(line)
raw_data.append(tweet)
except:
continue
print "Number of raw data rows - ", len(raw_data)
print "\n"
print "Here is what a JSON row in the raw data from the Twitter Streaming API looks like - "
print "\n"
print raw_data[0]
tweets = pd.DataFrame()
We took the raw data from the Twitter Streaming API and placed it in a Pandas Dataframe. There are many fields provided by the Twitter API, we used the Python library map function with a lambda to set pandas df columns for each of the fields of potential interest. To maintain a clean df, we set the value to 'None' in places where there is no value returned by the Twitter Streaming API:
#removed u'possibly_sensitive', at the moment since not all rows have it
#it goes between place and retweet_count in the below list
twitter_fields = [u'contributors', u'coordinates', u'created_at', u'entities',
u'favorite_count', u'favorited', u'filter_level', u'geo', u'id',
u'id_str', u'in_reply_to_screen_name', u'in_reply_to_status_id',
u'in_reply_to_status_id_str', u'in_reply_to_user_id', u'in_reply_to_user_id_str',
u'lang', u'place', u'retweet_count', u'retweeted', u'source',
u'text', u'timestamp_ms', u'truncated', u'user']
#set the columns in the dataframe to match the json fields of the twitter streaming api
for t in twitter_fields:
tweets[t] = map(lambda tweet: tweet[t] if tweet[t] else 'None', raw_data)
tweets['followers_count'] = map(lambda tweet: tweet['user']['followers_count'] if tweet['user'] != None else None, raw_data)
tweets['country'] = map(lambda tweet: tweet['place']['country'] if tweet['place'] != None else None, raw_data)
We will use many of these fields to do our analysis, draw interesting conclusions and visualize. For example, we are interested to find out how positive the sentiment of tweets from twitter users with the highest follower_counts is. We wondered if the top twitter users are paid to be positive and we wanted to see if the data reflects this possibility:
print "We can see the how many people are following the user of each tweet in the dataset - "
print tweets['followers_count'].head()
print "\n"
print "And it will be important to verify that the majority of the data is in American English since that is what we will be basing our sentiment analysis on -"
print tweets['lang'].head()
Now that we have collected the raw data and placed it in a pandas dataframe, let's take a look at some of the tweets text specifically:
#View the first 5 tweets of the dataset
pd.set_option('max_colwidth', 200)
tweets['text'].head(5)
First we install the plotly library for visualization
#Run these two commands in Terminal to initialize plotly online
#you may need to make plotly account before doing this
#Use this link to setup plotly: https://plot.ly/python/user-guide/
#pip install plotly
#python -c "import plotly; plotly.tools.set_credentials_file(username='nruta', api_key='mwv4tll3ev')"
#pip install cufflinks
#run these using your username and key
import plotly.tools as tls
tls.set_credentials_file(username='ayinmv', api_key='rq66z3hqx8')
import plotly.plotly as py
from plotly.graph_objs import *
!pip install plotly
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
print __version__ # requires version >= 1.9.0
#import cufflink package
import cufflinks as cf
print cf.__version__
init_notebook_mode() # run at the start of every ipython notebook to use plotly.offline
# this injects the plotly.js source files into the notebook
py.sign_in(username='ayinmv', api_key='rq66z3hqx8')
It is important to verify that the majority of the tweets are in American English since the sentiment analysis will depend on it. We are using a word list that weighs certain key nouns/adjectives based on the sentiment associated with them. It assumes American English is passed to it.
We used a histogram to verify the top language as English:
#histogram of all the languages we are detecting in our tweets
tweets_by_lang = tweets['lang'].value_counts()
tweets_by_lang.iplot(kind='bar', yTitle='Languages', title='Languages')
tls.embed('https://plot.ly/~ayinmv/97')
and another histogram to verify that the tweets are mostly from the USA:
#histogram of all the languages we are detecting in our tweets
tweets_by_country = tweets['country'].value_counts()
tweets_by_country.iplot(kind='bar', yTitle='Countries', title='Countries')
tls.embed('https://plot.ly/~ayinmv/138')
We wanted to see how many tweets were about each of the top ten that we selected for the project. First, we create a function to find a word in the tweet text column of the pandas dataframe we created:
import re
#create a function to find the word in the tweet text field
def word_in_text(word, text):
word = word.lower()
text = text.lower()
match = re.search(word, text)
if match:
return True
return False
and then we used that function, word_in_text, to create new columns for each airline and include the total count for each:
#get the words in the text
airlines = ['southwest', 'delta', 'jetblue', 'united', 'flyfrontier', 'hawaiianair', 'virginamerica', 'alaskaair','spiritairlines', 'AmericanAir']
#create columns for each airline and set a boolean value to use for the below visualization of tweet count
#for each airline
for a in airlines:
if (tweets['text'].apply(lambda tweet: word_in_text(a, tweet))).count() > 0:
tweets[a] = tweets['text'].apply(lambda tweet: word_in_text(a, tweet))
We can get the total count by adding up all of the rows, for a particular airline column of the dataframe, that have a True value. For example, here is the total count for Southwest Airline:
print len(tweets.loc[tweets['southwest'] == True])
#view the amount of tweets per airline
# tweets_by_airlines = [len(tweets.loc[tweets['southwest'] == True]),
# len(tweets.loc[tweets['delta'] == True]), len(tweets.loc[tweets['jetblue'] == True]),
# len(tweets.loc[tweets['united'] == True]), len(tweets.loc[tweets['flyfrontier'] == True]),
# len(tweets.loc[tweets['hawaiianair'] == True]),len(tweets.loc[tweets['virginamerica'] == True]),
# len(tweets.loc[tweets['alaskaair'] == True]),len(tweets.loc[tweets['spiritairlines'] == True]),
# len(tweets.loc[tweets['AmericanAir'] == True])]
# airlines = ['southwest', 'delta', 'jetblue', 'united', 'flyfrontier', 'hawaiianair', 'virginamerica', 'alaskaair','spiritairlines', 'AmericanAir']
# x_pos = list(range(len(airlines)))
# width = 0.3
# fig, ax = plt.subplots(figsize=(16,8))
# plt.bar(x_pos, tweets_by_airlines, width, alpha=1, color='g')
# # Setting axis labels and ticks
# ax.set_ylabel('Number of tweets', fontsize=15)
# ax.set_title('Ranking: Airlines (Raw data)', fontsize=10, fontweight='bold')
# ax.set_xticks([p + 0.9 * width for p in x_pos])
# ax.set_xticklabels(airlines)
# plt.grid()
We will need a new column for the dataframe called 'airline'. We created this for our spark text processing and LDA Topic modelling in addition to the visualizations we will create:
#create a dataframe called 'processed_data' with extra column 'airline'
#a function to set the airline name where the column value is True
def setAirlineName(row):
x = None
for a in airlines:
if row[a] is True:
x = a
if x is not None:
return x
else:
return
# set a column in the tweets dataframe for each airline name with a boolean value
for a in airlines:
if (tweets['text'].apply(lambda tweet: word_in_text(a, tweet))).count() > 0:
tweets[a] = tweets['text'].apply(lambda tweet: word_in_text(a, tweet))
#add the airline column with the airline name as the value
tweets['airline'] = tweets.apply(lambda row: setAirlineName(row), axis=1)
#remove the temp. columns for each airline
for a in airlines:
tweets = tweets.drop(a, 1)
#filter to just English-language tweets, since all our language processing is English-specific
processed_data = tweets[tweets['lang'] == 'en']
#processed_data['airline'] values are - 'southwest', 'delta', 'jetblue', 'united', 'flyfrontier',
# 'hawaiianair', 'virginamerica', 'alaskaair','spiritairlines', 'AmericanAir'
print 'The Processed Data File contains', len(processed_data), 'tweets.'
This new dataframe contains the 'airline' column we need:
processed_data['airline'].head()
At one point, we thought it would be necessary to have a JSON file created from this new dataframe. We created a script to do that but eventually decided to use the dataframe directly in the Spark text processing application:
#create a json file from the processed_data pandas dataframe
#it has the extra airline field that will be used for spark processing
#This is commented out since we are using the processed_data dataframe for the spark processing
# with open('processed_data.json', 'w') as outfile:
# for index, row in processed_data.iterrows():
# outfile.write(row.to_json())
# outfile.write('\n')
from pattern.en import parse
from pattern.en import pprint
from pattern.vector import stem, PORTER, LEMMA
from sklearn.feature_extraction import text
from gensim import corpora
import findspark
findspark.init()
print findspark.find()
import pyspark
# adapted from HW5
def get_parts(thetext, punc='.,;:!?()[]{}`''\"@#$^&*+-|=~_'):
# generate stopwords list & regexes for 2+ periods or 2+ dashes
stop = text.ENGLISH_STOP_WORDS
regex1=re.compile(r"\.{2,}")
regex2=re.compile(r"\-{2,}")
thetext=re.sub(regex1, ' ', thetext)
thetext=re.sub(regex2, ' ', thetext)
punctuation = list(punc)
nouns=[]
descriptives=[]
for i,sentence in enumerate(parse(thetext, tokenize=True, lemmata=True).split()):
nouns.append([])
descriptives.append([])
for token in sentence:
if len(token[4]) >0:
if token[1] in ['JJ', 'JJR', 'JJS']:
if token[4] in stop or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
continue
descriptives[i].append(token[4])
elif token[1] in ['NN', 'NNS']:
if token[4] in stop or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
continue
nouns[i].append(token[4])
out=zip(nouns, descriptives)
nouns2=[]
descriptives2=[]
for n,d in out:
if len(n)!=0 and len(d)!=0:
nouns2.append(n)
descriptives2.append(d)
return nouns2, descriptives2
# initialize Spark context
conf = pyspark.SparkConf().setAppName("Twitter_Airline").setMaster("local[*]")
sc = pyspark.SparkContext(conf=conf)
We read all the tweets from the dataframe into Spark, assigning each tweet a unique ID so we can track it through our sentiment analysis and LDA topic modeling.
# read tweets & associated airlines into Spark
tweets_text = sc.parallelize([(row['airline'], row['text']) for index, row in processed_data.iterrows()]).zipWithIndex().map(lambda ((air, txt), idx): ((idx, air), txt))
tweets_text.take(5)
Function to read the word list file:
import numpy as np
# read the word list
def readSentimentList(file_name):
ifile = open(file_name, 'r')
happy_log_probs = {}
sad_log_probs = {}
ifile.readline() #Ignore title row
# splitting the csv
for line in ifile:
tokens = line[:-1].split(',')
happy_log_probs[tokens[0]] = float(tokens[1])
sad_log_probs[tokens[0]] = float(tokens[2])
return happy_log_probs, sad_log_probs
Using Naive Bayes rule:
def classifySentiment(words, happy_log_probs, sad_log_probs):
# get the log-probability of each word under each sentiment
happy_probs = [happy_log_probs[word] for word in words if word in happy_log_probs]
sad_probs = [sad_log_probs[word] for word in words if word in sad_log_probs]
# sum all the log-probabilities for each sentiment to get a log-probability for the whole tweet
tweet_happy_log_prob = np.sum(happy_probs)
tweet_sad_log_prob = np.sum(sad_probs)
# calculate the probability of the tweet belonging to each sentiment
prob_happy = np.reciprocal(np.exp(tweet_sad_log_prob - tweet_happy_log_prob) + 1)
prob_sad = 1 - prob_happy
return prob_happy, prob_sad
Load the word list:
# load list of words and log probs
happy_log_probs, sad_log_probs = readSentimentList('wordlist.csv')
Reading in a sample tweet:
# read tweet
tweet1 = ['I', 'hate', 'southwest']
# calculate the probability
tweet1_happy_prob, tweet1_sad_prob = classifySentiment(tweet1, happy_log_probs, sad_log_probs)
print tweet1
print "happy probability: " , tweet1_happy_prob
print "sad probability:", tweet1_sad_prob
# get words out for sentiment analysis
puncs = '.,;:!?()[]{}`''\"@#$^&*+-|=~_'
sentiment_words = tweets_text.mapValues(lambda t: t.strip(puncs).split())
# classify sentiment of tweet
tweets_probs = sentiment_words.mapValues(lambda ws: classifySentiment(ws, happy_log_probs, sad_log_probs))
happy_probs = tweets_probs.mapValues(lambda (hprob, sprob): (hprob, 1))
sad_probs = tweets_probs.mapValues(lambda (hprob, sprob): (sprob, 1))
Create a dataframe to use for visualization.
#collect the happy probabilities from the above mapValues call
listOfHappyProbs = happy_probs.collect()
#create an empty list to contain the just the probability portion of happy_probs
HappyProbsList = []
for x in range(len(listOfHappyProbs)):
HappyProbsList.append(listOfHappyProbs[x][1][0])
#create the following dataframe containing columns of data for visualization
# text, airline, positive, prob, created_at, favorite_count, retweet_count, followers_count
df = pd.DataFrame()
df['created_at'] = processed_data['created_at']
df['favorite_count'] = processed_data['favorite_count']
df['retweet_count'] = processed_data['retweet_count']
df['followers_count'] = processed_data['followers_count']
# df_data_visualization['text'] = processed_data['text']
df['airline'] = processed_data['airline']
df['prob'] = HappyProbsList
#a function to set a 1 if the happy probability is greater than .5 otherwise set it to 0
#we will use this for the visualization aspect
def setPositiveValues(row):
if row['prob'] > 0.5:
return 1
else:
return 0
#run the function on the prob column to create the positive column to determine if a tweet is positive or negative
#from the happy probability
df['positive'] = df.apply(setPositiveValues, axis=1)
#create the text column, remove the newlines from the text to resolve an issue with row creation in the .to_csv call
df['text'] = processed_data['text'].str.replace('\n',"")
#create a function to set the airline names for the visualizations. We wanted the names to look clean on the graphs:
def setChartNames(row):
if row is 'united':
return "United"
elif row is 'southwest':
return 'Southwest'
elif row is 'delta':
return "Delta"
elif row is 'jetblue':
return 'JetBlue'
elif row is 'flyfrontier':
return 'Frontier'
elif row is 'hawaiianair':
return 'Hawaiian'
elif row is 'virginamerica':
return 'Virgin'
elif row is 'alaskaair':
return 'Alaska'
elif row is 'spiritairlines':
return 'Spirit'
elif row is 'AmericanAir':
return 'American'
#set the airline names using the above function
df['airline'] = df['airline'].apply(setChartNames)
#set the created_at column to a pandas datetime column
df['created_at'] = pd.to_datetime(df['created_at'], dayfirst=True )
#set the format of the dates to look clean on the visualizations to come
df['created_at'] = df['created_at'].map(lambda x: x.strftime('%m-%d-%Y'))
#create a .csv file to use for visualization
df.to_csv('output.csv', encoding='utf-8', index=False)
Here we start with the visualizaton. But first, we need to remove the undecided tweets.
#getting rid of undecided tweets
df = df[(df.prob <> .5)]
Some of the positive tweets:
df[(df.positive == 1)].head(5)
Some of the negative tweets:
df[(df.positive == 0)].head(5)
We can aggregate all the happy & sad probabilities for individual tweets about each airline into an average probability that users are happy or sad about that airline:
airline_happy_probs = happy_probs.map(lambda ((idx, air), probs): (air, probs)).reduceByKey(lambda (p1,num1),(p2,num2): ((num1*p1 + num2*p2)/(num1 + num2), num1 + num2)).mapValues(lambda (p, n): p)
airline_sad_probs = sad_probs.map(lambda ((idx, air), probs): (air, probs)).reduceByKey(lambda (p1,num1),(p2,num2): ((num1*p1 + num2*p2)/(num1 + num2), num1 + num2)).mapValues(lambda (p, n): p)
print "Happy probabilities:", airline_happy_probs.take(5)
print "Sad probabilities:", airline_sad_probs.take(5)
We can group the dataframe by airlines. Showing number of Tweets per airline:
airline_count = df.groupby(['airline']).count().prob.sort(axis=0, ascending=False, inplace=False)
airline_count.iplot(kind='bar', yTitle='Number of Tweets', title='Number of Tweets Per Airline')
tls.embed('https://plot.ly/~ayinmv/112')
Score of the airlines based on the average of their positive scores
airline_count = df.groupby(['airline']).mean().prob.sort(axis=0, ascending=False, inplace=False)
airline_count.iplot(kind='bar', yTitle='Average Score', title='Average Score')
tls.embed('https://plot.ly/~ayinmv/63')
To see the scores in a more defined way, we subtract average score and plot on a bar chart:
# Learn about API authentication here: https://plot.ly/pandas/getting-started
# Find your api_key here: https://plot.ly/settings/api
import plotly.plotly as py
import plotly.graph_objs as go
#making this dataframe to set the lables
dfx = pd.DataFrame([airline_count]).transpose().reset_index(level=0)
data = [
go.Bar(
x=dfx['airline'], # assign x as the dataframe column 'x'
y=airline_count - np.mean(airline_count)
)
]
# IPython notebook
df.iplot(data, yTitle='Average Score', title='Score Difference from Average Score')
tls.embed('https://plot.ly/~ayinmv/123')
Here we explore the opportunity to point out the most important tweets. We can for example sort by the tweet wit most followers:
pd.set_option('max_colwidth', 200)
#showing a subset of dataframe
df.sort(columns='followers_count', ascending=[False])[['followers_count','text', 'prob', 'created_at']].head(5)
We can take a better look at the data with quantile plots. We pull groupby airline output data.
# quantile plot for each airline
dfgroupby = pd.read_csv("outputgroupby.csv")
dfgroupby.iplot(kind='box', title='Score Quantiles Graph')
tls.embed('https://plot.ly/~ayinmv/147')
Here we would like to see if users with the highest followers count post unusually positive sentiment. We are suspecting that many of the most popular users have incentive to post positive tweets.
#making the data frame of categories of follower counts
dffollow = pd.DataFrame()
dffollow['100']=df[(df.followers_count > 100)].groupby(['airline']).mean().prob.sort(axis=0, ascending=False, inplace=False)
dffollow['1K']=df[(df.followers_count > 1000)].groupby(['airline']).mean().prob.sort(axis=0, ascending=False, inplace=False)
dffollow['10K']=df[(df.followers_count > 10000)].groupby(['airline']).mean().prob.sort(axis=0, ascending=False, inplace=False)
dffollow['100K']=df[(df.followers_count > 100000)].groupby(['airline']).mean().prob.sort(axis=0, ascending=False, inplace=False)
dffollow['1M']=df[(df.followers_count > 1000000)].groupby(['airline']).mean().prob.sort(axis=0, ascending=False, inplace=False)
dffollow
# quantile plot for each airline
dffollow.iplot(kind='box', title='Score Quantiles')
tls.embed('https://plot.ly/~ayinmv/83')
We did not see a particular trend to prove our hypothesis about the most popular users. The users that have some sort of incentive to post unusually positive tweets may only represent a small sample of the population.
LDA, or Latent Dirichlet Allocation, is a topic modeling algorithm that, given a corpus of documents, generates a set of topics associated with related words, and then assigns each document a likelihood of belonging to each topic using information about the corpus as well as random seeding. LDA determines how closely related two words are on the basis of their co-occurrence in one or more documents. However, words that do not appear in any documents together can end up associated with the same topic, if they both co-occur with another word. Even longer chains can also form topics – i.e., word A co-occurs with B, which co-occurs with C, which co-occurs with D, which co-occurs with E, and A through E could end up all in the same topic even though A and E neither co-occur nor share a word they co-occur with.
Note also that because LDA is a stochastic model, if you re-run the rest of this notebook, your results may be different from ours. The explanations accompanying our results are specific to a particular run of the notebook and will not describe the results you see if you re-run the notebook. However, the underlying principles should generalize to any run of the notebook.
We separate out the nouns from each tweet to feed into our LDA model, and create a gensim dictionary of all the nouns. We don't want to deal with misspellings or other terms that appear only once or twice in our dataset, so we filter terms that appear too few times out of our dictionary with dictionary.filter_extremes(), and then use dictionary.compactify() to remove gaps in the indices of the dictionary left by removing those terms.
The function dictionary.filter_extremes() also removes any terms that appear in more than half of all documents in the dataset, but these are generally stopwords (i.e., extremely common and uninformative words like "and" or "the"), which we have already removed in a previous step.
from operator import add
# parse nouns out of tweets
tweets_n_a = tweets_text.mapValues(get_parts)
tweets_nouns = tweets_n_a.mapValues(lambda (n, adj): n)
print tweets_nouns.take(5)
all_nouns = tweets_nouns.flatMapValues(lambda l: l).values().toLocalIterator()
# feed nouns into gensim
dictionary = corpora.Dictionary(all_nouns)
dictionary.filter_extremes()
dictionary.compactify()
We then vectorize the nouns in each tweet (the first map is just to get our input in the form that gensim's doc2bow function wants, while the filter removes tweets that didn't contain any identifiable nouns).
doc_vecs = tweets_nouns.mapValues(lambda n: reduce(add, n, [])).filter(lambda (k, v): v).mapValues(dictionary.doc2bow)
corpus = doc_vecs.values().collect()
We use Latent Dirichlet Allocation (LDA) from gensim to find latent topics within our tweets. LDA maps our corpus of documents (in this case, tweets) from the high-dimensional space of bag-of-words vectors (in which each unique token can be thought of as a "dimension") to a lower-dimensional space of a specific number of topics.
len(dictionary.keys())
NUM_TOPICS = 100
So in this case, our corpus represented as bag-of-words vectors can be thought of as a 2442-dimensional space, which is very difficult to work with to determine document similarities and differences. Instead, we will represent it as a 100-dimensional space of LDA topics, and in particular we will focus on just fifteen of the "best" topics returned by our LDA model. This will make it much more manageable to come to conclusions about the content of our documents.
from gensim.models.ldamodel import LdaModel
lda = LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary, passes=2)
Printing the top topics, we can see the top terms associated with each topic and the "coherence score" of the topic, a measure of how closely related the terms in each topic are.
The coherence score is calculated as: $$C(t; V^{(t)}) = \sum_{m=2}^{M} \sum_{l=1}^{m-1} log \frac {D\left(v^{(t)}_m, v^{(t)}_l\right) + 1} {D\left(v^{(t)}_l\right)}$$ where $V^{(t)} = \left(v^{(t)}_1, ..., v^{(t)}_M\right)$ is a list of the $M$ most probable words for topic $t$, $D(v)$ is the document frequency of the word v, and $D(v, v')$ is the co-document frequency of the words $v$ and $v'$, i.e., the number of documents in which both words appear (see the original paper for more details).
The coherence score thus measures roughly how likely it is that the words associated with a given topic are actually conceptually related to each other. The absolute score isn't terribly useful on its own, since the range of this function depends on the size of the corpus, length of documents, etc., but it's a useful way of ranking topics relative to each other on how likely they are to be informative. We thus take the top topics by coherence score to perform further analysis on.
best_topics = lda.top_topics(corpus)[:15]
for idx, tpc in enumerate(best_topics):
print "Topic", idx, ":"
print " score:", tpc[1]
print " terms:", tpc[0][:10]
Even after sorting by coherence score, many of these topics do not obviously correspond to a particular human-identifiable concept. Because tweets are so short, LDA has trouble coming up with enough information to assign terms to different topics in a way consistent with human understanding. However, we do have Topic 0, which includes "check-in", "counter", "cart", and "efficiency" among its highest-probability terms, and Topic 13, which includes "ticket" and "website". These give some indication of possible topics we might be interested in.
These ranks for the topics based on coherence score aren't the internal topic ids that gensim uses, though, so we have to find these internal ids for each of the topics in question.
all_topics = lda.show_topics(NUM_TOPICS, formatted=False)
def get_topic_id(topic, all_topics):
"""
Matches the topics returned by top_topics to their ids in the LDA model
by checking for term overlap.
"""
for tpc in all_topics:
if len(set([t[0] for t in tpc[1]]) & set([t[1] for t in topic[0]])) == 10:
return tpc[0]
best_topics_ids = [get_topic_id(topic, all_topics) for topic in best_topics]
for idx, tpc in enumerate(best_topics_ids):
print "Topic", idx, "above has ID", tpc
We can use these IDs to figure out which of the above topics each tweet is most associated with.
Now that we have a bunch of topics and we've assigned each tweet a probability of being "happy" or "sad" (i.e., expressing positive or negative sentiment), we can calculate what topics are most associated with positive or negative tweets.
We start by determining how likely each tweet is to be associated with each of our best topics.
def get_best_topics(bow):
my_topics = lda.get_document_topics(bow)
my_best_topics = [tpc for tpc in my_topics if tpc[0] in best_topics_ids]
return my_best_topics
We only keep tweets that have a non-negligible likelihood of belonging to one of our top topics (using the filter statement below).
tweets_topics = doc_vecs.mapValues(get_best_topics).filter(lambda (k,v): v)
tweets_topics.take(5)
We can pull out the correlation scores of each tweet to each topic and then plot a histogram for each topic showing its distribution of correlation scores.
# Get a list of all the tweets' correlation scores to each of our topics
scores_per_topic = tweets_topics.values().flatMap(lambda v: v).mapValues(lambda v: [v]).reduceByKey(add)
# Store it as a dict, keyed by the topic ID
hist_input = scores_per_topic.collectAsMap()
fig, axes = plt.subplots(5,3, sharex=True)
fig.set_size_inches(16, 12)
for idx, ax in enumerate(axes.ravel()):
ax.hist(hist_input[best_topics_ids[idx]], 20, range=[0,1])
ax.set_title('Topic {} (id {}): {} total tweets'.format(idx, best_topics_ids[idx], len(hist_input[best_topics_ids[idx]])))
For most topics, there are many tweets that are a little bit related to the topic, and fewer that the model says are very closely related to the topic. There also appears to be a disproportionate concentration of correlation scores in the 0.5 bin, particularly compared to the bins immediately surrounding it. This is likely due to a mathematical quirk of LDA, which is particularly visible because tweets are very short documents.
In general, the LDA topic model is not hugely confident in its assignment of individual tweets to topics. This is not surprising, because an individual tweet contains very few words (and even fewer nouns recognizable by the part-of-speech tagger, particularly given that tweets do not always follow standard grammatical structure), so there is not very much basis on which to assign a tweet to one topic or another. One way to improve this would be to use a part-of-speech tagger trained on a Twitter corpus, so that more information could be extracted from each tweet. However, 140 characters is never going to contain a huge number of words for the LDA model to base its topic assignment on, so automatic topic assignments of tweets are inherently going to be less certain than automatic topic assignments of longer documents. Even though our topic results may be somewhat difficult to interpret, they still illustrate how LDA can be used to break down a complex corpus into features, which can then be used to break down sentiment analysis by topic.
Next, we combine our topic RDD with our sentiment RDD to convert the topics and sentiment scores associated with each tweet into sentiment scores associated with each topic, and with each topic-airline combination. We will weight a tweet's contribution to each topic's sentiment score by its correlation score to that topic.
# Join topics and sentiment probabilities by key
combined = tweets_topics.join(tweets_probs)
combined.take(5)
We then rearrange the data to make it possible to average sentiment contributions across all tweets associated with a given topic.
def regroup_by_topic(topics_probs, happy_prob, sad_prob):
topic_sentiments = []
for topic, prob in topics_probs:
topic_sentiments.append((topic, (prob, happy_prob, sad_prob)))
return topic_sentiments
regrouped = combined.mapValues(lambda (tps, (hp, sp)): regroup_by_topic(tps, hp, sp)).flatMapValues(lambda v: v)
regrouped.take(5)
Now we combine the probabilities per topic to find out what airline-related topics people on Twitter are most positive or negative about.
weighted_avg_probs = lambda (p1, hp1, sp1), (p2, hp2, sp2): (p1 + p2, (p1*hp1 + p2*hp2)/(p1 + p2), (p1*sp1 + p2*sp2)/(p1 + p2))
happy_sad_probs = lambda (p, hp, sp): (hp, sp)
topics_sentiments = regrouped.values().reduceByKey(weighted_avg_probs).mapValues(happy_sad_probs)
print topics_sentiments.take(5)
topic_sent_dict = topics_sentiments.collectAsMap()
index = np.arange(len(best_topics_ids))
happy_probs_by_topic = np.array([topic_sent_dict[idx][0] for idx in best_topics_ids])
order = np.argsort(happy_probs_by_topic)[::-1]
width = 0.5
fig, ax = plt.subplots(figsize=(16,8))
plt.bar(index, happy_probs_by_topic[order], width, alpha=1, color='violet')
# Setting axis labels and ticks
ax.set_xlim((-width, len(best_topics_ids)))
ax.set_ylabel('Happy Probability', fontsize=15)
ax.set_xlabel('Topic ID', fontsize=15)
ax.set_title('Twitter Happiness by Topic', fontsize=15, fontweight='bold')
ax.set_xticks([i + 0.5 * width for i in index])
ax.set_xticklabels(np.array(best_topics_ids)[order])
plt.grid()
plt.show()
The airline-related topic that Twitter users are most positive about overall is the topic with ID 37, which is characterized by the terms "priority", "effort", and "transit". These tweets are likely about general experiences with the airline, particularly priority programs and airline workers.
The airline-related topic that Twitter users are least positive about overall is the topic with ID 62, which is characterized by the terms "lack", "size", and "respect", as well as picking up the specific airline-related terms of "deltaairline" and "deltaone". The low happiness associated with this category is likely a reflection of negative tweets across the board about concerns regarding insufficient space on planes and insufficient respect for passengers, but may also reflect pockets of frustration with Delta, despite the fact that Delta received a higher average happiness score overall than any other airline.
Another comparatively negative topic is ID 75, characterized by "courtesy", "standing", and "struggle". This category may be more difficult to interpret as a human-understandable concept due to its lower coherence score, but we could infer that tweets about encounters with airline personnel or other passengers might be found in this category.
We're less interested in what people think of different aspects of airlines overall than in how individual airlines score in each of these categories, though. Let's re-group our data by airline and topic and see how each airline stacks up on each topic.
Like above, we work with the joined RDD and extract weighted happy and sad probabilities, but this time we use airline, topic pairs as our keys rather than just topics alone. This breaks down how happy or sad Twitter users are about each topic specifically with regards to each airline.
airline_topics_sentiments = regrouped.map(lambda ((idx, air), (tpc, probs)): ((air, tpc), probs)).reduceByKey(weighted_avg_probs).mapValues(happy_sad_probs)
print airline_topics_sentiments.take(5)
air_tpc_sent_dict = airline_topics_sentiments.collectAsMap()
We can graph the average happiness associated with tweets about each airline in each topic, again weighted by each tweet's degree of association with the topic in question. Every airline except Hawaiian Air has at least one tweet associated with each topic.
index = np.arange(len(airlines))
colors = ['#975A7A', '#78C0E0', '#FFB3AA', '#4F86C6', '#17A398', '#EE8434', '#FFD5FF', '#FFBC00', '#009FFD', '#8565A0']
fig, axes = plt.subplots(5, 3, figsize=(16,16), sharex=True, sharey=True)
for tpc, ax in zip(best_topics_ids, axes.ravel()):
happy_probs_by_airline = [air_tpc_sent_dict.get((air, tpc), [0])[0] for air in airlines]
width = 0.5
barlist = ax.bar(index, happy_probs_by_airline, width, alpha=1)
for bar, color in zip(barlist, colors):
bar.set_color(color)
# Setting axis labels and ticks
ax.set_xlim((-width, len(airlines)))
ax.set_ylabel('Happy Probability')
ax.set_xlabel('Airline')
ax.set_title('Topic ID {}'.format(tpc))
ax.set_xticks([])
fig.legend(barlist, airlines, loc=(0.2,0.94), ncol=5)
plt.show()
The missing bars for Hawaiian Air in topics 78 and 37 indicate no data (not zero probability of happiness).
We can see that even per topic, most tweets are generally rated positive by our sentiment analysis. However, there are some topics where a few airlines stand out as substantially better or worse than others. In particular, Southwest ranks poorly on topics 78 ("snack", "check-in", "bonus", "cart", "counter") and 48 ("fence", "pat", "shoe", "pre-check") but is otherwise associated with as much or more happiness than other airlines. This might indicate that customers are unsatisfied with Southwest's check-in process and experiences going through security, but otherwise pretty happy with the airline.
Hawaiian Air ranks quite poorly on topic 54, which could be an indictment of their snacks since that topic contains "peanut" and "chip", but it could also be a legal problem or a problem with their treatment of military passengers since the topic also contains the terms "soldier" and "lawsuit". Again, the uncertainty in the topics generated by LDA on our dataset is a function of how short tweets are, meaning that they don't provide very much information about what terms are related to each other to form topics in the first place, and also don't provide much information about what topics they might belong to.
Overall, we found that even though our dataset of tweets is not optimal for generating highly informative LDA topics, breaking down our sentiment analysis by topic shows us underlying differences in sentiment between airlines in different aspects of their service. With a larger dataset, we might be able to offset the shortness of individual tweets to generate more useful topics. However, longer documents – e.g., Facebook posts about airlines – would be even more likely to reveal useful categories that people discuss with regards to airlines.
If we had more time, we would have liked to try the following:
We had a goal of collecting at least 100k tweets, given the time constraint. We accomplished this but we would have liked to collect a much larger dataset. This would have provided more robust algorithmic performance and a better result from the topics derived from the LDA process.
We would like to do more work on the Naive Bayes Probabilities dataset in order to improve the sentiment analysis algorithm accuracy.
We plan to create a word set of 'amplifiers' to exaggerate the positivity or negativity of tweets that include certain words. For example, the word incredible could make a tweet more positive or negative.
We would like to work towards a highly scalable version of this workflow. This would be needed for larger datasets.
We would like to be able to weigh tweets based on sentiment analysis of a user's past twitter history. This would contribute to a better learning model.